De Novo Genome Assembly ◾ 107
for the actual genomes. This is because it is hard to avoid errors in sequencing and also
genomes of many organisms including repetitive sequences. But the assembly accuracy
can be increased by deep sequencing, paired-end sequencing, and the use of long reads
produced by PacBio and Oxford Nanopore Technology. The de novo genome assembly
has been widely used for different kinds of organisms, specially in metagenomics for the
assembly of bacterial, fungal, and viral genomes.
We can use either greedy alignment approach or graph-based approaches for the de novo
genome assembly. The greedy method works as multiple sequence alignment by perform-
ing pairwise alignment and merging reads to build up contigs. The graph theory approach
can be either overlap-consensus graphs or de Bruijn graphs. In the overlap graphs, reads
are represented as nodes and the overlapping regions of the reads as edges. Contigs are
built by finding the Hamiltonian path which includes each node once. On the other hand,
de Bruijn graphs form k-mers from the reads and the k-mers then are represented as nodes.
Contigs are formed by including edges using Eulerian path. De Bruijn graphs are more
efficient than overlap graphs. The assemblers that use Bruijn graphs include ABySS for
small and large genomes and SPAdes for bacterial, fungal, and viral small genomes. SPAdes
program has several modules such as metagenomic module, viral assembly module, and
SARS-CoV2. SPAdes can also be used to assemble a genome from hybrid reads such as
Illumina reads and PacBio reads or Oxford Nanopore reads.
After assembling, a genome can be assessed using both statistical method and evolu-
tionary method. The statistical method generates the number, lengths, and distributions of
contigs. The assembly with few but long contigs is an indicator of the good quality. Metrics
used to describe statistical quality are N25, N50, N75, L25, L50, and L75. We can com-
pare the performance of assemblers using these metrics. The evolutionary assessment of an
assembly relies on the genomes of the evolutionarily related species to identify the number
of genes in the assembled genome. The completeness of the genome assembly depends on
the complete identified genes. For statistical quality assessment, we can use QUAST, and
for evolutionary assessment, we can use BUSCO.
REFERENCES
1. Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathemati-
cal analysis. Genomics 1988, 2(3):231–239.
2. Pop M, Kosack D: Using the TIGR assembler in shotgun sequencing projects. Methods Mol
Biol 2004, 255:279–294.
3. de la Bastide M, McCombie WR: Assembling genomic DNA sequences with PHRAP. Curr
Protoc Bioinform 2007, Chapter 11:Unit11. 14.
4. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry
CM, Reinert KH, Remington KA et al: A whole-genome assembly of Drosophila. Science
2000, 287(5461):2196–2204.
5. Pevzner PA, Tang H: Fragment assembly with double-barreled data. Bioinformatics 2001,
17(suppl_1):S225–S233.
6. Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly.
Proc Natl Acad Sci U S A 2001, 98(17):9748–9753.
7. Chaisson MJ, Pevzner PA: Short read fragment assembly of bacterial genomes. Genome Res
2008, 18(2):324–330.